Initial integration of TUF client into Nexus #469

iliana · 2021-12-01T20:57:00Z

RFD 183 recommends the use of TUF as a transport mechanism to get updates from Oxide to the rack for a few reasons, one of which is "I wrote a TUF client once already". This is the initial integration work to add tough to Nexus.

I am now, of course, paying the costs of decisions I made two years ago. tough is not yet async (since it was written right as async/.await was starting to stabilize), RepositoryBuilder and Repository are not Send, and tough has a defensively-built API, intended for systems that have few mechanisms to handle threats to an update system. So there's this awkward tokio::task::spawn_blocking boundary between tough and the rest of Nexus.

Note the update URLs are currently hardcoded to paths on http://localhost:8000 because Nexus doesn't fetch the Rack from the database yet.

Quick top-level overview

artifacts.json is a target in the TUF repository we build. It describes updates available for a specific "thing" (in this case, the not-yet-existing CockroachDB zone), an integer version number, what kind of thing it is (in this case, a zone), and the name of the target as listed in the targets.json TUF role.

{
    "version": "1",
    "artifacts": [
        {
            "name": "cockroach",
            "version": 1,
            "kind": "zone",
            "target": "whatever.tar.gz"
        }
    ]
}

This code adds an /updates/refresh endpoint to more-or-less copy the data from this file (and the target's checksum/size) into a database table.

The goal is for Nexus to be aware of what updates are available for different "things", and be able to download those while adhering to the TUF specification (not fetching if the metadata is expired, and checking the checksum and size), but not having to re-fetch the metadata every time it wants to fetch an artifact.

(I am going to work on artifact fetching next, and either append it to this PR or in a next PR.)

nexus is async, tough is not async, but tough uses reqwest in blocking mode which is async. https://docs.rs/reqwest/0.11.7/reqwest/blocking/index.html: > Conversely, the functionality in `reqwest::blocking` must not be > executed within an async runtime, or it will panic when attempting to > block. moves the relevant code out to updates.rs, since the function can't borrow self due to lifetime constraints.

smklein

Awesome start, super cool to see this coming together!

I think your judicious use of spawn_blocking to decouple from the tough async constraints was a good call - the rest of PR looks mostly unaffected by that part of the codebase.

smklein · 2021-12-01T21:16:52Z

smf/nexus/config.toml

+
+[updates]
+# If not present, accessing the TUF updates repository will fail
+#tuf_trusted_root = "/opt/oxide/nexus/pkg/root.json"


Where would this root.json come from? I'm noticing this part of Nexus config present-but-commented-out everywhere, sorta wondering what the plan is gonna be for getting this running in a dev/test environment.

I should probably write up a script for generating one of these, but copying and pasting from history:

tuftool root init 1.root.json tuftool root gen-rsa-key 1.root.json ~/test.key -r root snapshot targets timestamp [ manually edit a few things in 1.root.json, namely expiration and key thresholds ] tuftool root sign -k ~/test.key 1.root.json

then creating a repository from the files in ~/in:

tuftool create -k ~/test.key --outdir repo --root 1.root.json \ --snapshot-expires 'in 3 days' --targets-expires 'in 3 days' --timestamp-expires 'in 3 days' \ --snapshot-version $(date +%s) --targets-version $(date +%s) --timestamp-version $(date +%s) \ --add-targets ~/in

smklein · 2021-12-01T21:17:53Z

nexus/src/external_api/http_entrypoints.rs

+/**
+ * Refresh update metadata
+ */


This seems like a good candidate for an addition to https://github.com/oxidecomputer/omicron/blob/main/tools/oxapi_demo , as a convenience

smklein · 2021-12-01T21:18:00Z

nexus/src/external_api/http_entrypoints.rs

+}]
+async fn updates_refresh(
+    rqctx: Arc<RequestContext<Arc<ServerContext>>>,
+    // _query_params: Query<PaginatedById>,


smklein · 2021-12-01T21:23:54Z

nexus/src/nexus.rs

+        // first, the JoinError:
+        .map_err(|e| Error::InternalError { internal_message: e.to_string() })?


This only happens when the call to read_artifacts panics, right?

I think we're using panic=abort, so this shouldn't happen,

... if it did, panicking ASAP seems like the right behavior.

In other codebases, I just unwrapped the first error for this reason: https://github.com/oxidecomputer/async-bb8-diesel/blob/09f17db081be85abdf4f773a385381637b9c8d76/src/connection.rs#L42-L45

smklein · 2021-12-01T21:25:12Z

nexus/src/nexus.rs

+        .map_err(|e| Error::InternalError { internal_message: e.to_string() })?
+        // next, the boxed dyn Error:
+        .map_err(|e| Error::InternalError {
+            internal_message: e.to_string(),


Nit: Could we add a little context here - just something like "we were trying to refresh updates, and saw {}"

smklein · 2021-12-01T21:27:17Z

nexus/src/nexus.rs

+        // FIXME: if we hit an error in any of these database calls, the available artifact table
+        // will be out of sync with the current artifacts.json. can we do a transaction or
+        // something?


can we do a transaction or something

Yeah, there's an example here: https://github.com/oxidecomputer/async-bb8-diesel/blob/22c26ef840075a57b18ac2eae3ead989b992773e/examples/usage.rs#L72-L82

If we wanted do this, we would probably have a call to self.db_datastore that takes all the artifacts we wanna upsert at once, so the transaction could be self-contained in the datastore function.

All that being said... the upserts are idempotent, right? We could have a separate single-column table for targets_role_version, which we update once all the necessary artifacts are updated. Then the steps would be:

upsert all the artifacts

update the "desired" targets role version

delete all older artifacts

And I don't think they'd need to be transactional anymore.

Either way, if you wanna punt on this until after the PR is merged, could we file a bug to track? DB concurrency shenanigans are hairy, but I don't want us to lose track of 'em.

How is this function called? If it's from an API endpoint, and the API request fails, then at least the caller knows it didn't work and we're basically saying its their responsibility to retry until it works.

Is it also a problem that it might be inconsistent? That is, we might have updated some of the artifacts but not all of them?

If it's inconsistent, then either Nexus will be missing information about a newer update, or might try to update to a removed artifact (if we ever have to do that for some reason).

smklein · 2021-12-01T21:42:05Z

nexus/src/nexus.rs

+            tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(),
+            tuf_targets_base_url: "http://localhost:8000/targets".to_string(),


These might also be good candidates for the Nexus config file?

Also, should we have instructions on "how to set this up"? I don't think these servers are running on most people's machines.

I've just been running python3 -m http.server in the output directory from tuftool. (Instructions for making a repo probably should go along with that.)

These maybe should belong to the Nexus config file to start, but these eventually need to be configurable by an operator... thoughts?

What's the longer term plan here? Are we building a new component to serve this endpoint? Is it part of the control plane? If not, maybe we want to treat this like we do Clickhouse and CockroachDB, with easy ways to run it by hand that we use in the test suite to get coverage of this stuff?

Longer-term plan is for this to default to something.oxide.computer via HTTPS and be stored in the database, ready for an end-user to change it if they need to.

smklein · 2021-12-01T21:43:14Z

nexus/src/nexus.rs

+
+        let mut current_version = None;
+        for artifact in artifacts {
+            current_version = Some(artifact.targets_role_version);


As a sanity-check -- if we see multiple values of artifact.targets_role_version, they should all be the same, right? (trying to mentally verify that the order of artifacts couldn't change behavior)

Yeah, these will all be the same.

This is kind of messy, it's a last minute "well crap I need this value" that we can clean up when moving the upsert and delete operations into the same transaction.

smklein · 2021-12-01T21:46:10Z

nexus/src/nexus.rs

+                .update_available_artifact_hard_delete_outdated(current_version)
+                .await?;
+        }
+


When do you think we'd actually try to apply updates (basically, making the calls to POST /update on each sled) to the rest of the rack? Would that be at the end of this function, or some other time?

For the demo, probably immediately just for simplicity's sake?

At release I think a background task that periodically checks for out-of-date artifacts on the rack and starts applying updates is probably correct. Otherwise we need to handle applying updates here as well as when a sled-agent comes online (e.g. brand new sled, or sled that's been powered off for a few weeks).

smklein · 2021-12-01T21:48:58Z

nexus/src/updates.rs

+    pub name: String,
+    pub version: i64,
+    pub kind: UpdateArtifactKind,
+    pub target: String,


(... I feel like I may have asked you this before, but I can't remember the answer)

What is this field? Is it a reference to the targets.json file somehow? If it's the name of the artifact, how is it different from name?

Related: Could we add some docs?

It's the name of the target in targets.json, yeah. I should do a docs / "is this name sensible" pass again.

davepacheco

As Sean said, this looks like a great start!

davepacheco · 2021-12-02T18:27:42Z

nexus/examples/config-file.toml

-mode = "file"
-path = "logs/server.log"
-if_exists = "append"
+#mode = "file"


This file was supposed to be exactly the same as nexus/examples/config.toml, except it logs to a file instead of stdout. I think maybe you didn't mean to comment out this section?

davepacheco · 2021-12-02T18:28:24Z

nexus/src/config.rs

+pub struct UpdatesConfig {
+    /** Trusted root.json role for the TUF updates repository. If `None`, accessing the TUF
+     * repository will fail. */
+    pub tuf_trusted_root: Option<PathBuf>,


Is there any deployment where we might want to leave this out? (i.e., why not require this? is it a pain for development?)

I imagine this could be temporary, but given all the work going on right now I didn't want to break folks' testing workflows. There should not be production deployments where this is somehow left unconfigured.

davepacheco · 2021-12-02T18:29:27Z

nexus/src/config.rs

@@ -330,6 +340,7 @@ mod test {
                        .parse()
                        .unwrap()
                },
+                updates: UpdatesConfig { tuf_trusted_root: None },


I think this test fills everything out instead of leaving anything optional. It seems useful to have at least one test that does that for the updates section.

davepacheco · 2021-12-02T18:34:49Z

nexus/src/nexus.rs

@@ -142,8 +144,8 @@ impl Nexus {
     * Create a new Nexus instance for the given rack id `rack_id`
     */
    /* TODO-polish revisit rack metadata */
-    pub fn new_with_id(
-        rack_id: &Uuid,
+    pub async fn new_with_id(


Wondering out loud: I've been trying to keep this sync because it's tempting to put async things in here that belong as background processes (e.g., saga recovery or database initialization). I don't fully grok the flow of the root JSON file but it looks like this is reasonable to be done immediately. But I wonder if the (async) caller should pass in the file contents instead?

We could also keep it sync and just use std::fs::read. Root roles are generally expected to be very small (hence why we're just keeping it in memory).

Alternatively we can just pass along the PathBuf of the root to Nexus, and Nexus can read it whenever /updates/refresh is hit.

davepacheco · 2021-12-02T18:36:31Z

nexus/src/nexus.rs

+            tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(),
+            tuf_targets_base_url: "http://localhost:8000/targets".to_string(),


What's the longer term plan here? Are we building a new component to serve this endpoint? Is it part of the control plane? If not, maybe we want to treat this like we do Clickhouse and CockroachDB, with easy ways to run it by hand that we use in the test suite to get coverage of this stuff?

davepacheco · 2021-12-02T18:39:16Z

nexus/src/nexus.rs

+        // FIXME: if we hit an error in any of these database calls, the available artifact table
+        // will be out of sync with the current artifacts.json. can we do a transaction or
+        // something?


How is this function called? If it's from an API endpoint, and the API request fails, then at least the caller knows it didn't work and we're basically saying its their responsibility to retry until it works.

Is it also a problem that it might be inconsistent? That is, we might have updated some of the artifacts but not all of them?

... since We only get a JoinError if the task we're waiting on panics.

* a4c956e chore(deps): lock file maintenance (#474) * 6847b04 chore(deps): update rust crate nix to 0.28 (#473) * c30aa36 Benchmarks and instrumentation (#395) * 53201de Fix xtask non-package install for fresh systems. (#470) * 7e56634 Correctly preserve inbound UFID on outbound TCP close (#469)

iliana added 9 commits December 1, 2021 19:31

initial work for using the tough client in nexus

2835896

allow an unconfigured updates system

d6ea973

woo license header

6f074c2

fix hardcoded base URLs to use localhost, for now

90b4443

add smf/nexus/root.json to .gitignore

7475b47

rename columns to make more sense at first glance

68402cc

keep table in sync with artifacts.json

1ee0f5b

make the tests happy

873e439

iliana requested a review from smklein December 1, 2021 20:57

smklein reviewed Dec 1, 2021

View reviewed changes

davepacheco reviewed Dec 2, 2021

View reviewed changes

iliana added 5 commits December 2, 2021 21:28

add /updates/refresh to oxapi_demo

b2c0851

fix examples/config-file.toml

c1b03f4

test [updates] in config works

8295a3a

remove dead code

f95628f

.unwrap() on JoinError from spawn_blocking

e872745

... since We only get a JoinError if the task we're waiting on panics.

iliana mentioned this pull request Mar 3, 2022

[v2] TUF integration in Nexus + update artifact fetching by sled-agent #717

Merged

iliana closed this Mar 3, 2022

iliana deleted the nexus-update-refresh branch May 16, 2024 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial integration of TUF client into Nexus #469

Initial integration of TUF client into Nexus #469

iliana commented Dec 1, 2021 •

edited

Loading

smklein left a comment

smklein Dec 1, 2021

iliana Dec 1, 2021 •

edited

Loading

smklein Dec 1, 2021

smklein Dec 1, 2021

smklein Dec 1, 2021

smklein Dec 1, 2021

smklein Dec 1, 2021

davepacheco Dec 2, 2021

iliana Dec 2, 2021

smklein Dec 1, 2021

iliana Dec 1, 2021

davepacheco Dec 2, 2021

iliana Dec 2, 2021

smklein Dec 1, 2021

iliana Dec 1, 2021

smklein Dec 1, 2021

iliana Dec 1, 2021

smklein Dec 1, 2021

iliana Dec 1, 2021

davepacheco left a comment

davepacheco Dec 2, 2021

davepacheco Dec 2, 2021

iliana Dec 2, 2021

davepacheco Dec 2, 2021

davepacheco Dec 2, 2021

iliana Dec 2, 2021

davepacheco Dec 2, 2021

davepacheco Dec 2, 2021

		// first, the JoinError:
		.map_err(\|e\| Error::InternalError { internal_message: e.to_string() })?

		tuf_metadata_base_url: "http://localhost:8000/metadata".to_string(),
		tuf_targets_base_url: "http://localhost:8000/targets".to_string(),

Initial integration of TUF client into Nexus #469

Initial integration of TUF client into Nexus #469

Conversation

iliana commented Dec 1, 2021 • edited Loading

Quick top-level overview

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iliana Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iliana commented Dec 1, 2021 •

edited

Loading

iliana Dec 1, 2021 •

edited

Loading